Authors: Mauro Venticinque | Angelo Schillaci | Daniele Tambone
GitHub project: Bank-Marketing
Date: 2025-05-20
In this project, we analyze data from a Portuguese banking institution’s direct marketing campaigns to identify key factors influencing customer subscription to term deposits.
A deposit account is a bank account maintained by a financial institution in which a customer can deposit and withdraw money. Deposit accounts can be savings accounts, current accounts or any of several other types of accounts explained below.
The dataset includes client demographics, previous campaign interactions, and economic indicators. Our goal is to develop insights that will enhance the effectiveness of future marketing strategies. By applying supervised learning techniques, we aim to predict customer responses and optimize outreach efforts for better engagement and conversion rates.
The report will begin with an Exploratory Data Analysis, examining the variables and their relationship with the target attribute (subscribed) to identify the most influential factors.
age (Integer): age of the customerjob (Categorical): occupationmarital (Categorical): marital statuseducation (Categorical): education leveldefault (Binary): has credit in default?housing (Binary): has housing loan?loan (Binary): has personal loan?contact (Categorical): contact communication typemonth (Categorical): last contact month of yearday_of_week (Integer): last contact day of the
weekduration (Integer): last contact duration, in seconds
(numeric). Important note: this attribute highly affects the output
target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known
before a call is performed. Also, after the end of the call y is
obviously known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a realistic
predictive modelcampaign (Integer): number of contacts performed during
this campaign and for this client (numeric, includes last contact)pdays (Integer): number of days that passed by after
the client was last contacted from a previous campaign (numeric; -1
means client was not previously contacted)previous (Integer): number of contacts performed before
this campaign and for this clientpoutcome (Categorical): outcome of the previous
marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)subscribed (Binary): has the client subscribed a term
deposit?Source: UCI Machine Learning Repository
Note: In our dataset there isn’t the bank
balancevariable
| Name | train |
| Number of rows | 32950 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| job | 0 | 1 | 6 | 13 | 0 | 12 | 0 |
| marital | 0 | 1 | 6 | 8 | 0 | 4 | 0 |
| education | 0 | 1 | 7 | 19 | 0 | 8 | 0 |
| default | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| housing | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| loan | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| contact | 0 | 1 | 8 | 9 | 0 | 2 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 10 | 0 |
| day_of_week | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
| poutcome | 0 | 1 | 7 | 11 | 0 | 3 | 0 |
| subscribed | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.04 | 10.45 | 17.00 | 32.00 | 38.00 | 47.00 | 98.00 | ▅▇▃▁▁ |
| duration | 0 | 1 | 258.66 | 260.83 | 0.00 | 102.00 | 180.00 | 318.00 | 4918.00 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.57 | 2.77 | 1.00 | 1.00 | 2.00 | 3.00 | 43.00 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 961.90 | 188.33 | 0.00 | 999.00 | 999.00 | 999.00 | 999.00 | ▁▁▁▁▇ |
| previous | 0 | 1 | 0.17 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
| emp.var.rate | 0 | 1 | 0.08 | 1.57 | -3.40 | -1.80 | 1.10 | 1.40 | 1.40 | ▁▃▁▁▇ |
| cons.price.idx | 0 | 1 | 93.57 | 0.58 | 92.20 | 93.08 | 93.75 | 93.99 | 94.77 | ▁▆▃▇▂ |
| cons.conf.idx | 0 | 1 | -40.49 | 4.63 | -50.80 | -42.70 | -41.80 | -36.40 | -26.90 | ▅▇▁▇▁ |
| euribor3m | 0 | 1 | 3.62 | 1.74 | 0.63 | 1.34 | 4.86 | 4.96 | 5.04 | ▅▁▁▁▇ |
| nr.employed | 0 | 1 | 5167.01 | 72.31 | 4963.60 | 5099.10 | 5191.00 | 5228.10 | 5228.10 | ▁▁▃▁▇ |
The dataset includes 21 variables and 32,950 rows, with no
missing values.
Categorical variables like job and
education show good diversity, while
default, loan, and
housing have only 3 unique values.
Among numeric variables, age has a fairly normal
distribution (mean ≈ 40, sd ≈ 10), while
duration and pdays are highly skewed,
with extreme values up to 4918 and 999 respectively.
Some variables (e.g., campaign,
previous) have a low median but long tails, indicating
that most observations are clustered at low values.
Macroeconomic variables such as emp.var.rate,
euribor3m, and nr.employed are more
stable, with tight interquartile ranges, suggesting consistent economic
conditions during data collection.
Firstly we see that this dataset are unbaleanced, with the majority of people that have not subscribed.
Correlation Matrix
The
correlation matrix reveals clear patterns among the numerical variables.
Notably, euribor3m, nr.employed, and
emp.var.rate are strongly positively correlated with
each other, these suggest these variables capture similar information
about the economic environment. This should be taken into account in
predictive modeling, as using them together could lead to
multicollinearity. In contrast, variables like
campaign, pdays, and
previous show very weak correlations with most other
features, indicating they may contribute more independently to the
model.
Scatterplot Matrix by
Target
Several variables, such as duration
and pdays, show highly skewed
distributions, which could influence model performance and may
benefit from transformations (e.g., log or binning).While some variables
exhibit linear trends (e.g., euribor3m vs nr.employed), many
scatterplots show dispersed or nonlinear patterns. This suggests that
simple linear models may not fully capture the complexity in the
data.
In certain plots, the blue points (subscribed) are concentrated in specific areas, showing the key factors that influenced successful subscriptions.
Box plot of age
It is
harder to see older people say no
Box plot of emp.var.rate
Text
Box plot of euribor3m
Text
Distribution of Age
The
age distribution is right-skewed, with a peak around 30–40 years old.
The proportion of people that have subscribed is higher among those over
60.This may be due to greater financial stability in older age
groups.
Distribution of Job
The
distribution of the occupation is not uniform, with the majority of
people that are admin. The proportion of people that have subscribed is
among the higest between all the occupation. This is probably due to the
fact that people that are admin have a higher income and are more likely
to subscribe. While student and retired people have a higher proportion
of subscription, this explain that we saw in the previous plot that the
older people and the people with higher education level are more likely
to subscribe.
Distribution of Education
About Education Level, we can see that the distribution of the education
level is not uniform, with the majority of people that have a university
degree. The proportion of people that have a university degree and that
have subscribed is among the higest between all the education level.
This is probably due to the fact that people that have a university
degree have a higher income and are more likely to subscribe.
Distribution of Marital
status
Text.
Distribution of Contact
Text.
Distribution of Contacts
About previous campaign, while most clients were not previously
contacted, the success rate is visibly higher among those who were
previously contacted more than once or had a successful prior outcome.
This suggests that prior engagement is positively associated with
subscription, but they are a small part of sample.
Distribution of Days of
Week
The distribution of the last contact day of the week
is uniform, with the majority of people that have been contacted on
Thursday. The proportion of people that have subscribed is among the
higest when the last contact day of the week is on the middle of
week.
Distribution of Months
Instead, the distribution of the last contact month of the year is not
uniform, with the majority of people that have been contacted in May.
The proportion of people that have subscribed is among the higest when
the last contact month of the year is in March, December, September and
October. This is probably due to the fact that people are more likely to
subscribe when they have more money and not during the summer.
Distribution of Duration
The duration of the last contact is right-skewed, with a peak around
0-100 seconds. The proportion of people that have subscribed is higher
among people that have been contacted for a longer duration. This is
probably due to the fact that people that have been contacted for a
longer duration are more interested to subscribe.
The Exploratory Data Analysis reveals several important insights into the factors that influence the likelihood of subscription in this dataset. Below there is a summary of the key findings:
In summary, the analysis suggests that financial conditions, previous campaign interactions, and macroeconomic indicators are strong predictors of subscription behavior. Demographic factors such as age, occupation, and education level also contribute meaningfully to the outcome.
In the next section, we will use these EDA findings to conduct a preliminary skim of the most influential variables, based on the visual trends observed in the plots.
Based on the Exploratory Data Analysis (EDA), we selected only the most relevant variables.
With a view to training the model, we apply one-hot encoding. We obtain the following dataset:
| Variable | Type |
|---|---|
| age | int |
| single | bool |
| cellular | bool |
| low_call | bool |
| previous | int |
| negative_emp | bool |
| low_cpi | bool |
| high_cci | bool |
| low_euribor | bool |
| university | bool |
| p_course | bool |
| job_student | bool |
| job_retired | bool |
| job_admin | bool |
| month_sep | bool |
| month_oct | bool |
| month_dec | bool |
| month_mar | bool |
| p_failure | bool |
| p_success | bool |
| target | bool |
full_model <- glm(target ~ ., data = full_df, family = binomial)
stepwise <- stepAIC(full_model, direction = "both", trace = FALSE)
vif(stepwise)
## single cellular low_call previous negative_emp low_cpi
## 1.161049 1.257740 1.015602 4.030389 4.797421 1.673660
## high_cci low_euribor university p_course job_student job_retired
## 1.356434 5.001553 1.257091 1.123516 1.172201 1.114729
## job_admin month_sep month_oct month_dec month_mar p_failure
## 1.209488 1.109976 1.074431 1.051192 1.039553 2.885715
## p_success
## 2.653624
# predictore removed by Stepwise
stepwise$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## target ~ age + single + cellular + low_call + previous + negative_emp +
## low_cpi + high_cci + low_euribor + emp_cat + university +
## p_course + job_student + job_retired + job_admin + month_sep +
## month_oct + month_dec + month_mar + p_failure + p_success
##
## Final Model:
## target ~ single + cellular + low_call + previous + negative_emp +
## low_cpi + high_cci + low_euribor + university + p_course +
## job_student + job_retired + job_admin + month_sep + month_oct +
## month_dec + month_mar + p_failure + p_success
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 32929 18541.01 18583.01
## 2 - emp_cat 0 0.000000000 32929 18541.01 18583.01
## 3 - age 1 0.009056746 32930 18541.02 18581.02
stepwise_formula <- target ~ single + cellular + low_call + previous + negative_emp +
low_cpi + high_cci + low_euribor + university + p_course +
job_student + job_retired + job_admin + month_sep + month_oct +
month_dec + month_mar + p_failure + p_success
set.seed(123)
df_no_target <- subset(full_df, select = -target)
fit_lasso <- glmnet(x = as.matrix(df_no_target),
y = target,
alpha = 1,
family = "binomial",
)
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
cv_fit <- cv.glmnet(
x = as.matrix(df_no_target),
y = target,
alpha = 1,
family = "binomial"
)
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
plot(cv_fit)
# predictors selected by Lasso
coef(cv_fit, s = "lambda.1se")
## 22 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -3.40260756
## age .
## single .
## cellular 0.26429155
## low_call .
## previous .
## negative_emp 0.33148264
## low_cpi -0.15716970
## high_cci 0.54539233
## low_euribor 1.22932991
## emp_cat .
## university 0.02599663
## p_course .
## job_student 0.27874368
## job_retired 0.29253325
## job_admin .
## month_sep 0.25751717
## month_oct 0.59082454
## month_dec 0.28825879
## month_mar 0.96303668
## p_failure -0.12024151
## p_success 1.54745367
lasso_formula <- target ~ cellular + negative_emp +
low_cpi + high_cci + low_euribor + university +
job_student + job_retired + month_sep + month_oct +
month_dec + month_mar + p_failure + p_success
lasso_mod<-glm(lasso_formula, data=full_df, family=binomial)
summary(stepwise)
##
## Call:
## glm(formula = target ~ single + cellular + low_call + previous +
## negative_emp + low_cpi + high_cci + low_euribor + university +
## p_course + job_student + job_retired + job_admin + month_sep +
## month_oct + month_dec + month_mar + p_failure + p_success,
## family = binomial, data = full_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.12030 0.10037 -41.052 < 2e-16 ***
## single 0.08955 0.04463 2.007 0.04477 *
## cellular 0.48749 0.05437 8.966 < 2e-16 ***
## low_call 0.28095 0.09007 3.119 0.00181 **
## previous 0.11396 0.05858 1.946 0.05171 .
## negative_emp 0.54235 0.09230 5.876 4.20e-09 ***
## low_cpi -0.55291 0.05180 -10.674 < 2e-16 ***
## high_cci 0.70524 0.04530 15.568 < 2e-16 ***
## low_euribor 1.40709 0.09007 15.622 < 2e-16 ***
## university 0.13424 0.04578 2.932 0.00336 **
## p_course 0.10243 0.06164 1.662 0.09655 .
## job_student 0.48616 0.10267 4.735 2.19e-06 ***
## job_retired 0.52247 0.07832 6.671 2.54e-11 ***
## job_admin 0.10922 0.04752 2.298 0.02153 *
## month_sep 0.32774 0.10954 2.992 0.00277 **
## month_oct 0.77146 0.09676 7.973 1.55e-15 ***
## month_dec 0.79355 0.18087 4.387 1.15e-05 ***
## month_mar 1.12505 0.10898 10.323 < 2e-16 ***
## p_failure -0.58715 0.09471 -6.199 5.67e-10 ***
## p_success 1.26651 0.11283 11.225 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23199 on 32949 degrees of freedom
## Residual deviance: 18541 on 32930 degrees of freedom
## AIC: 18581
##
## Number of Fisher Scoring iterations: 6
summary(lasso_mod)
##
## Call:
## glm(formula = lasso_formula, family = binomial, data = full_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.82821 0.05730 -66.814 < 2e-16 ***
## cellular 0.50192 0.05423 9.255 < 2e-16 ***
## negative_emp 0.56272 0.09214 6.108 1.01e-09 ***
## low_cpi -0.57685 0.05098 -11.314 < 2e-16 ***
## high_cci 0.73139 0.04488 16.296 < 2e-16 ***
## low_euribor 1.42629 0.08966 15.907 < 2e-16 ***
## university 0.15292 0.04187 3.652 0.00026 ***
## job_student 0.50976 0.09664 5.275 1.33e-07 ***
## job_retired 0.46140 0.07643 6.037 1.57e-09 ***
## month_sep 0.33778 0.10917 3.094 0.00197 **
## month_oct 0.78992 0.09678 8.162 3.30e-16 ***
## month_dec 0.77731 0.18053 4.306 1.66e-05 ***
## month_mar 1.14224 0.10900 10.479 < 2e-16 ***
## p_failure -0.44660 0.05939 -7.519 5.51e-14 ***
## p_success 1.43643 0.07412 19.379 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23199 on 32949 degrees of freedom
## Residual deviance: 18567 on 32935 degrees of freedom
## AIC: 18597
##
## Number of Fisher Scoring iterations: 6
# Compare the models
stepwise_results <- k_fold_mod(data = full_df, target_col = "target", model_formula = stepwise)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
lasso_results <- k_fold_mod(data = full_df, target_col = "target", model_formula = lasso_mod)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
print(stepwise_results)
## $Accuracy_at_best_threshold
## [1] "0.8991 (threshold = 0.53)"
##
## $F1_at_best_threshold
## [1] "0.4832 (threshold = 0.22)"
##
## $Sensitivity
## [1] 0.538
##
## $Specificity
## [1] 0.9127
##
## $AUC
## [1] 0.7811
##
## $AIC
## [1] 16724.85
print(lasso_results)
## $Accuracy_at_best_threshold
## [1] "0.899 (threshold = 0.5)"
##
## $F1_at_best_threshold
## [1] "0.4855 (threshold = 0.21)"
##
## $Sensitivity
## [1] 0.5374
##
## $Specificity
## [1] 0.9144
##
## $AUC
## [1] 0.7796
##
## $AIC
## [1] 16738.96
# Threshold evaluation
probs_stepwise <- predict(stepwise, type = "response")
probs_lasso <- predict(lasso_mod, type = "response")
res_step_05 <- evaluate_threshold(probs_stepwise, target, 0.5)
res_step_02 <- evaluate_threshold(probs_stepwise, target, 0.2)
res_lasso_05 <- evaluate_threshold(probs_lasso, target, 0.5)
res_lasso_02 <- evaluate_threshold(probs_lasso, target, 0.2)
# Unisci tutti i risultati in una lista
results_list <- list(
Stepwise_0.5 = res_step_05,
Stepwise_0.2 = res_step_02,
LASSO_0.5 = res_lasso_05,
LASSO_0.2 = res_lasso_02
)
# Trasforma in data.frame
results_df <- do.call(rbind, lapply(names(results_list), function(name) {
res <- results_list[[name]]
model <- sub("_.*", "", name)
threshold <- res$Threshold
data.frame(
Model = model,
Threshold = threshold,
Accuracy = res$Accuracy,
F1 = res$F1,
Sensitivity = res$Sensitivity,
Specificity = res$Specificity
)
}))
# Visualizza il risultato
print(results_df)
## Model Threshold Accuracy F1 Sensitivity Specificity
## 1 Stepwise 0.5 0.8990 0.3182 0.2093 0.9865
## 2 Stepwise 0.2 0.8633 0.4833 0.5673 0.9009
## 3 LASSO 0.5 0.8990 0.3216 0.2126 0.9861
## 4 LASSO 0.2 0.8528 0.4676 0.5735 0.8883
## Warning in lda.default(x, grouping, ...): variables are collinear
results_lda_full <- cv_lda_eval(full_df, target ~ ., model_name = "LDA_Full", k = 10)
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
## Warning in lda.default(x, grouping, ...): variables are collinear
results_lda_stepwise <- cv_lda_eval(full_df,stepwise_formula, model_name = "LDA_Stepwise", k = 10)
results_lda_lasso <- cv_lda_eval(full_df, lasso_formula, model_name = "LDA_LASSO", k = 10)
lda_results <- rbind(results_lda_full, results_lda_stepwise, results_lda_lasso)
aggregate(cbind(Accuracy, F1, Sensitivity, Specificity) ~ Model + Optimized_For, data = lda_results, mean)
## Model Optimized_For Accuracy F1 Sensitivity Specificity
## 1 LDA_Full Accuracy 0.89993 0.30377 0.19681 0.98907
## 2 LDA_LASSO Accuracy 0.89984 0.29692 0.18898 0.99008
## 3 LDA_Stepwise Accuracy 0.89999 0.30326 0.19630 0.98921
## 4 LDA_Full F1 0.86630 0.47438 0.53193 0.90867
## 5 LDA_LASSO F1 0.86551 0.47189 0.53333 0.90780
## 6 LDA_Stepwise F1 0.86541 0.47439 0.53515 0.90725
aggregate(Threshold ~ Model + Optimized_For, data = lda_results, mean)
## Model Optimized_For Threshold
## 1 LDA_Full Accuracy 0.867
## 2 LDA_LASSO Accuracy 0.888
## 3 LDA_Stepwise Accuracy 0.865
## 4 LDA_Full F1 0.159
## 5 LDA_LASSO F1 0.161
## 6 LDA_Stepwise F1 0.156
print(lda_lasso_model)
## Call:
## lda(lasso_formula, data = full_df)
##
## Prior probabilities of groups:
## 0 1
## 0.8873445 0.1126555
##
## Group means:
## cellular negative_emp low_cpi high_cci low_euribor university job_student
## 0 0.6122512 0.3732471 0.1958410 0.4123743 0.2781654 0.2894179 0.01631439
## 1 0.8332435 0.7685884 0.3922414 0.5660022 0.7206358 0.3617996 0.06115302
## job_retired month_sep month_oct month_dec month_mar p_failure
## 0 0.03543334 0.00844791 0.01084205 0.002667761 0.007250838 0.09990423
## 1 0.09671336 0.05711207 0.06627155 0.019935345 0.059267241 0.13011853
## p_success
## 0 0.01347561
## 1 0.19585129
##
## Coefficients of linear discriminants:
## LD1
## cellular 0.2532871
## negative_emp 0.2779495
## low_cpi -0.7380515
## high_cci 0.4582446
## low_euribor 1.3869866
## university 0.1056124
## job_student 0.6428778
## job_retired 0.5251612
## month_sep 0.9114776
## month_oct 1.3410837
## month_dec 1.5371912
## month_mar 1.9147897
## p_failure -0.4066345
## p_success 2.8641107
probs <- predict(lda_lasso_model)$posterior[, "1"]
pred_class <- ifelse(probs > 0.16, 1, 0)
table(Predicted = pred_class, Actual = full_df$target)
## Actual
## Predicted 0 1
## 0 26388 1813
## 1 2850 1899
evaluate_threshold(probs, full_df$target, threshold = 0.16)
## $Threshold
## [1] 0.16
##
## $Accuracy
## [1] 0.8585
##
## $F1
## [1] 0.4489
##
## $Sensitivity
## [1] 0.5116
##
## $Specificity
## [1] 0.9025
roc_obj <- roc(full_df$target, probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_obj, main = "ROC Curve for LDA LASSO Model")
auc(roc_obj)
## Area under the curve: 0.7778
#results_qda_full <- cv_qda_eval(full_df, target ~ ., model_name = "QDA_Full", k = 10)
#results_qda_stepwise <- cv_qda_eval(full_df, stepwise_formula, model_name = "QDA_Stepwise", k = 10)
#results_qda_lasso <- cv_qda_eval(full_df, lasso_formula, model_name = "QDA_LASSO", k = 10)
#qda_results <- rbind(results_qda_full, results_qda_stepwise, results_qda_lasso)
#aggregate(cbind(Accuracy, F1, Sensitivity, Specificity) ~ Model + Optimized_For, data = qda_results, mean)
#aggregate(Threshold ~ Model + Optimized_For, data = qda_results, mean)
#print(qda_lasso_model)
#probs_qda <- predict(qda_lasso_model)$posterior[, "1"]
#pred_class_qda <- ifelse(probs_qda > 0.4, 1, 0)
#table(Predicted = pred_class_qda, Actual = full_df$target)
#evaluate_threshold(probs_qda, full_df$target, threshold = 0.4)
#roc_qda <- roc(full_df$target, probs_qda)
#plot(roc_qda, main = "ROC Curve for QDA LASSO Model")
#auc(roc_qda)
df_for_Tree <- full_df
df_for_Tree$target <- factor(ifelse(full_df$target == 1, "Yes", "No"))
train_index <- createDataPartition(df_for_Tree$target, p = 0.8, list = FALSE)
train_set <- df_for_Tree[train_index, ]
test_set <- df_for_Tree[-train_index, ]
set.seed(123)
treeDf <- tree(target ~ ., train_set)
## Warning in tree(target ~ ., train_set): NAs introduced by coercion
summary(treeDf)
##
## Classification tree:
## tree(formula = target ~ ., data = train_set)
## Variables actually used in tree construction:
## [1] "low_euribor" "high_cci" "p_success"
## Number of terminal nodes: 4
## Residual mean deviance: 0.5752 = 15160 / 26360
## Misclassification error rate: 0.1014 = 2674 / 26361
plot(treeDf)
text(treeDf, pretty = 0)
#cvDf <- cv.tree(treeDf)
#cvDf
#plot(cvDf$size, cvDf$dev , type = "b")
set.seed(123)
rf_train <- randomForest(target ~ ., train_set, mtry = 7, importance = TRUE)
rf_train
##
## Call:
## randomForest(formula = target ~ ., data = train_set, mtry = 7, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 10.35%
## Confusion matrix:
## No Yes class.error
## No 22906 485 0.02073447
## Yes 2244 726 0.75555556
tree_pred <- predict(rf_train, newdata = test_set)
table(tree_pred, test_set$target)
##
## tree_pred No Yes
## No 5723 558
## Yes 124 184
importance(rf_train)
## No Yes MeanDecreaseAccuracy MeanDecreaseGini
## age 25.2956110 12.1292203 31.3416991 506.93156
## single 23.0065081 -6.4563694 19.7977154 58.06339
## cellular 19.3375782 18.0844678 30.3248184 59.53520
## low_call 8.7337941 7.5815638 12.8268986 33.92329
## previous 7.1033205 7.0518287 8.6774367 108.54280
## negative_emp 11.9843440 10.2966356 13.0175552 65.60228
## low_cpi 61.2900304 -53.8018408 45.7776987 66.29689
## high_cci 39.2888282 38.7908040 46.9895493 281.63954
## low_euribor 15.5742123 23.4782191 23.0223018 199.56946
## emp_cat 12.3823131 11.8823441 13.8186300 78.34894
## university 8.6464326 0.6008189 9.0141606 66.04441
## p_course 1.6225574 6.3389898 5.1957953 47.67366
## job_student 12.0673720 16.4276156 19.2978405 25.59045
## job_retired 0.5329116 -1.2010725 -0.1877720 25.25449
## job_admin -1.1612488 9.0321440 5.0966691 59.93971
## month_sep 2.3396265 -1.1599812 1.5764949 29.91551
## month_oct 11.7808716 33.6707759 19.0857578 52.32859
## month_dec -16.4558116 19.7569315 0.9232797 16.22967
## month_mar -7.5072834 27.6167143 6.2854111 54.33978
## p_failure 12.6180629 -16.2779556 10.4101050 37.35659
## p_success 5.8042644 57.9513826 35.2105357 316.75321
varImpPlot(rf_train)
#x <- full_df[, !(names(df_for_Tree) %in% "target")]
#y <- full_df$target
#xtrain <- x[train_index, ]
#ytrain <- y[train_index]
#xtest <- x[-train_index, ]
#ytest <- y[-train_index]
#set.seed(123)
#bartfit <- lbart(xtrain, ytrain, x.test = xtest)
#yhat_bart <- bartfit$prob.test.mean
#table(yhat_bart, test_set$target)
#table(pred = yhat_bart > 0.5, actual = ytest)
Social and economic context attributes:
emp.var.rate(Integer): employment variation rate - quarterly indicatorcons.price.idx(Integer): consumer price index - monthly indicatorcons.conf.idx(Integer): consumer confidence index - monthly indicatoreuribor3m(Integer): euribor 3 month rate - daily indicatornr.employed(Integer): number of employees - quarterly indicator